Adapting LSI for Fine-Grained and Multi-Level Document Comparison

نویسندگان

  • Nicholas Adelman
  • Marin Simina
چکیده

In recent years, Latent Semantic Indexing (LSI) has been recognized as an effective tool for Information Retrieval in text documents. The level of “granularity” in LSI (i.e. whether LSI is performed on documents, paragraphs, sentences, phrases, etc.) is somewhat of a limiting factor, in that LSI comparisons can only be made at the level of granularity chosen. Here we argue that, as long as a record of the document structure is maintained, the level of granularity may be arbitrarily fine while still allowing for comparison at any coarser granularity. It is shown that the reduced-dimension vector for any particular section of a document is a function of the vectors of its constituent subsections. Using this information, we illustrate how LSI can be used to compare documents at multiple structural levels. One possible application (automated plagiarism detection) is discussed as an example of how this method of multilevel comparison may be used to improve query time in fine-granularity LSI applications.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Supporting Document-Category Management: An Ontology-based Document Clustering Approach

Automated document-category management, particularly the document clustering, represents an appealing alternative of supporting a user’s search, access, and utilization of the ever-increasing corpora of textual. Traditional document clustering techniques generally emphasize on the analysis of document contents and measure document similarity on the basis of the overlap between or among the feat...

متن کامل

Semi-supervised latent variable models for sentence-level sentiment analysis

We derive two variants of a semi-supervised model for fine-grained sentiment analysis. Both models leverage abundant natural supervision in the form of review ratings, as well as a small amount of manually crafted sentence labels, to learn sentence-level sentiment classifiers. The proposed model is a fusion of a fully supervised structured conditional model and its partially supervised counterp...

متن کامل

Investigating the Effect of Sedimentary Basin on Consolidation of Kerman Fine-Grained Soils

In this research, the effects of a sedimentary basin, environmental conditions, and the passage of time were investigated on consolidation processes and engineering characteristics of fine-grained soils in Kerman city. For this purpose, the natural consolidation curves of soil samples extracted from different locations of Kerman city were compared with the Kerman city intrinsic consolidation li...

متن کامل

Optimization of ECMAP parameters in production of ultra-fine grained Al1050 strips using Grey relational analysis

Production of lightweight metals with a higher strength to weight ratio is always the main goal of researchers. In this article, equal channel multi angular pressing (ECMAP) process as one of the most appealing severe plastic deformation (SPD) methods on production of ultra-fine grained (UFG) materials studied. Two main routes A and C investigated by FEM and compared with each other from differ...

متن کامل

A New Fine-Grained Weighting Method in Multi-Label Text Classification

Multi-label classification is one of the important research areas in data mining. In this paper, a new multilabel classification method using multinomial naive Bayes is proposed. We use a new fine-grained weighting method for calculating the weights of feature values in multinomial naive Bayes. Our experiments show that the value weighting method could improve the performance of multinomial nai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004